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RECURSIVE MONTE CARLO FILTERS: ALGORITHMS 
AND THEORETICAL ANALYSIS 

By Hans R. Kunsch 
ETH Ziirich 

Recursive Monte Carlo filters, also called particle niters, are a 
powerful tool to perform computations in general state space mod- 
els. We discuss and compare the accept-reject version with the more 
common sampling importance resampling version of the algorithm. 
In particular, we show how auxiliary variable methods and strati- 
fication can be used in the accept-reject version, and we compare 
different resampling techniques. In a second part, we show laws of 
large numbers and a central limit theorem for these Monte Carlo fil- 
ters by simple induction arguments that need only weak conditions. 
We also show that, under stronger conditions, the required sample 
size is independent of the length of the observed series. 

1. State space and hidden Markov models. A general state space or hid- 
den Markov model consists of an unobserved state sequence (Xt) and an 
observation sequence (Yt) with the following properties: 

State evolution: Xq, X\, X2, ... is a Markov chain with Xq ~ ao(x) dfj,(x) 
and 

X t \X t -i =x t -i ~ at(x t -i,x) d(J,(x). 

Generation of observations: Conditionally on (Xt), the Yf's are indepen- 
dent and Yt depends on Xt only with 

Y t \X t = x t ~ b t (x t ,y) dv(y). 

These models occur in a variety of applications. Linear state space models 
are equivalent to ARMA models (see, e.g., [16]) and have become popular 
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under the name of structural models (see, e.g., [17]). Nonlinear state space 
models occur in finance (stochastic volatility; see, e.g., [27]), in various fields 
of engineering (speech, tracking and control problems; see, e.g., [12]), in bi- 
ology (ion channels, DNA and protein sequences) and in geophysics (rainfall 
at a network of stations, data assimilation). A more detailed survey with 
many references is given in [20]. 

In order to apply these models, two kinds of problems have to be solved: 
Inference about the states based on a stretch of observed values y s -.t = 
(]Ju,s < u < t) for a given model, that is, at and 6j known (this is called 
prediction, filtering and smoothing), and inference about unknown parame- 
ters in at, bf. From a statistical point of view, the latter problem is maybe 
of greater interest, but fast and reliable algorithms for the former are a pre- 
requisite for computing maximum likelihood or Bayesian estimators. The 
reason for this is briefly mentioned in Section 2.1. This paper is therefore 
entirely devoted to algorithms for filtering, prediction and smoothing. 

Section 2 recalls the basic recursions for filtering, prediction and smooth- 
ing. Section 3 discusses algorithmic aspects of sequential Monte Carlo meth- 
ods to implement these recursions. Most algorithms in the literature, begin- 
ning with the pioneering paper by Gordon, Salmond and Smith [15], use the 
sampling importance resampling idea of Rubin [26] . An exception is Hiirzeler 
and Kiinsch [18] who use the accept-reject method instead. Here we show 
how some ideas like stratification and an auxiliary variable method of Pitt 
and Shephard [23] can be adapted to rejection sampling, and we give new 
results on the performance of systematic resampling methods. In addition, 
we hope that our view of classifying and comparing approaches is useful. 

Section 4 presents results on the convergence of the method as the num- 
ber of Monte Carlo replicates tends to infinity. We discuss both laws of large 
numbers and a central limit theorem. Recently, many similar results have 
been published; see, for example, [4, 8, 21]. The distinctive features of our 
presentation here are the weakness of conditions, the use of the total vari- 
ation distance to measure the difference between the approximate and the 
true filter density and the simplicity of the techniques used. We basically 
show that most results follow by induction, in accordance with the recur- 
sive nature of the algorithm. The complications that occur are due to a 
counterintuitive property of Bayes' formula; see Lemma 3.6(h) in [20]. As a 
consequence, although one can obtain consistency with very few conditions 
on the model, the required sample size seems to grow exponentially with the 
number of time steps. For results that guarantee that the required sample 
size is independent of the number of time steps (or grows at most logarith- 
mically), one has to use induction over several time steps which requires 
rather strong conditions on the dynamics of the states. At the end, we give 
some results for the case where (Xt) is a continuous time process and the 
sampling rate of the observations increases. 
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2. Filtering and smoothing recursions. In general, we will use the symbol 
p as the generic notation for a conditional density of its arguments. However, 
for the conditional density of Xt given Y\ ■ s = y\ . s , we use the notation 
ft\s( x t\yi-. s)- The three cases s <t, s = t and s > t are called prediction, 
filtering and smoothing, respectively. 

The dependence structure of a state space model can be represented by 
the following directed acyclic graph: 

■ • • >■ X t -\ > X t > X t+ \ 



y 

Y t -! Y t Y t+1 



From this, various conditional independence properties follow which are used 
together with the law of total probability and Bayes theorem to derive recur- 
sions for the filter, prediction and smoothing densities. These are well known; 
see, for example, [20], Section 3.3, and we state them without proofs. 

The most important result is the following recursion for the filter density: 

Propagation. From the filter density, we obtain the one-step-ahead pre- 
diction density: 

(1) ft\t-l(xt\yut-l) = J ft-i\t-i(x\yi:t-i)at(x,x t )dfi(x). 

Update. From the-one-step ahead prediction density, we obtain the filter 
density one time step later: 

, i , ft\t-i{xt\yi-.t-i)bt{xt,yt) 

It\t\ x t\yi:t) — -77 -7—, ttt \ t / \ 

J Jt\t-i{x\yi-t-i)Ot{x,y t )dfi(x) 

(2) 

oc ft\t~i(xt\yi:t-i)bt(x t ,yt). 

In parts of the literature, for example, in [8], Yt depends on Xt-i and not 
on Xt. Then the filter density is, in our setup, the prediction density which 
should be kept in mind when comparing formulae. 

2.1. Prediction of observations and likelihood. The denominator in the 
update step (2) is the conditional density of Yt given Yi : t_i: 



(3) P(yt\yi-.t-i) = J ft\t-i(x\yi:t-i)b t (x,y t )dn(x). 

If ft\t-i is available, we thus can obtain the likelihood from 

T 

p(yi:r) = Y[p(yt\yi:t-i)- 

t=i 
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A different representation of the likelihood is obtained by marginalization: 

- T T 

p{v\:t)= J a (xo)Y[ a t(xt-i,xt)bt{x t ,yt)l[[diJ,(xt). 

t=l t=0 

From this, the likelihood ratio can be expressed as an expectation with 
respect to the smoothing distribution; see, for example, [19]. 

2.2. Smoothing. The filter densities can also be used for the smoothing 
problem since conditional on y\ . t, (Xt, Xt-i, • • • , Xq) is an inhomogeneous 
Markov chain with starting density /tit an d backward transition densities 

(4) p(x t \x t+ i,y 1 . T ) = p(x t \x t+ i,y 1 . t ) tx a t+ i(x t , x t+1 )f t \ t (x t \y x . t ). 

This is also the basis for the forward-filtering-backward-sampling algorithm; 
see [13], equation (20). From (4), we can derive, in particular, a backward 
recursion for f t \x- 

2.3. Recursive filtering in operator notation. A compact notation for the 
filter recursion which will be useful later on is 

(5) ft\t(-\yv.t) = B(A*ft-i\t-i(-\yi-.t-i),bt(-,yt))- 

Here 



A tf( x ) = J f(x')at(xf,x)d(i(x') 
is the Markov transition operator, and 

B(f,b)(x) - 



I f(x)b(x) d(i(x) 

is the Bayes operator that assigns the posterior to a prior / and a likelihood 
b. The operators A$ and B(-,b) map the space of densities into itself, but 
they can be extended to the space of probability distributions. 

2.4. Implementation of recursions. If X t is discrete with M possible val- 
ues, integrals are sums and the recursions need 0(TM 2 ) operations. In a 
linear Gaussian state space model, all f t \ s are Gaussian, and their means 
and variances are computed with the Kalman filter and smoother. 

In practically all other cases, the recursions are difficult to compute. An- 
alytical approximations like the extended Kalman filter are not satisfactory, 
and numerical integration is problematic in high dimensions. Much current 
interest focuses on Monte Carlo methods. Standard Markov chain Monte 
Carlo can be used, but it lacks a recursive implementation. There has been 
considerable interest in recursive Monte Carlo methods in recent years; see, 
for example, [12]. 
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3. Algorithms for recursive Monte Carlo filtering. The following is the 
key observation: A\f is difficult to compute, but easy to sample from if 
we can sample from / and ctt(x,-). This allows us to generate recursively 
a sequence of samples ("particles") {xj.uj = 1, ■ • • , N, t = 0, 1, . . . ) with ap- 
proximate distribution f t u as follows: If (xj.t-i) is available, we can replace 

^/t-i|t-i(a;t|yi:t-i) = j ft-i\t-i(x\yi:t-i)at(x,x t )dn(x) 

by 

1 N 

— ^atixjj-uxt). 
j'=i 

Therefore, we sample (xjt) from the distribution with density 

1 N 

(6) f* t (-\yi :t )(xb t (-,y t )—J2 a t( x j,t~i,-)- 

i=i 

In this section we discuss methods to sample from such a density. We simplify 
the notation somewhat and write the target density as 

N 

(7) f N (x)*f?(x) = b(x)Y,a(j,x) 

3=1 

(subscript u for unnormalized) . We will call b the likelihood and iV -1 £^ ■ a(j, x) 
the prior. In the filtering context, the prior is the approximate prediction 
density. For later use, we also introduce 



/%= / a(j,x)b(x)dfj,(x), 

which is in the filtering context equal to the conditional density of Yj given 
Xt-i = Xj,t-i- We assume that we have good methods to generate sam- 
ples from a(j, •) for any j. The methods we discuss fall into two categories: 
accept-r eject and importance sampling with an additional resampling step. 

3.1. Accept-reject methods. The accept-reject method for sampling from 
the density (7) produces values X according to a proposal p, and if X = x 
accepts it with probability 

f N (x) 

Here M is an upper bound for the ratio f^{x)/p{x): 

M >sup^C^- 

x p[x) 
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The most obvious proposal p(x) is the prior, that is, 

1 N 

(9) p(x) = jj^2a{j,x). 

3=1 

Then the evaluation of the acceptance probabilities ir(x) is easy as long as b 
is bounded. In order to sample from (9), we first choose an index J uniformly 
from {l,...,iV}, and given J = j, we sample X from a(j,x). Note that, in 
this case, the densities a(j,x) need not be available in analytic form; we 
only have to be able to sample from them. This is of interest in discretely 
observed diffusion models. 

The average acceptance probability of this algorithm is / p(x)n(x) dp(x) = 
J2j Pj/M. In particular, if p is the prior and if we use the smallest value of 
M, it is equal to 



Nsap x b(x)' 

This is low if the likelihood is more informative (concentrated) than the 
prior, or if the likelihood and the prior are in conflict. We discuss here some 
modifications and tricks that can alleviate this problem in some situations. 

3.1.1. The mixture index as auxiliary variable. Other proposal distribu- 
tions than the prediction density can, of course, lead to higher acceptance 
rates, but usually it is difficult to compute a good upper bound M, and 
the evaluation of the acceptance probability ir(x) is complicated due to the 
sum over j. A way to avoid at least the last problem is based on an idea by 
Pitt and Shephard [23]. Namely, we can generate first an index J according 
to a distribution (jj) and given J = j, a variable X according to a density 
p(J,x). We then accept the generated pair (j,x) with probability 

( 10 ) 7T(j, X) = — r— , 

MTjp(j,X) 

where now 

a(j, x)b(x) 
M>sup u ' / Y - 

j,x Tjp{j,x) 

If the pair is accepted, we simply discard j and keep x, and otherwise we 
generate a new pair. Because the accepted pairs (J,X) have distribution 

a(j,x)b(x) 

' 

the marginal distribution of X is the target (7). If we take tj = 1/N and 
p(j, x) = a(j, x), we obtain the usual algorithm discussed before, but one will 
try to increase the acceptance rate by other choices. 
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Because j runs over a finite set, we will usually take 

Mj , Jjr a(j,x)b(x) 

M = max — - , where Mj > sup ■ 



J Tj x p{j,x) 

Lemma 1. For a given choice of densities p(j, x) and bounds Mj, the av- 
erage acceptance probability is less than or equal to J2 ftj/J2Mj, with equality 
iff Tj oc Mj . 

Proof. The average acceptance probability is 

j j j T 

Clearly, 

M k M k ^ 
max = > To max > > M,- , 

k T k j k T k ~ j 

with equality iff M^/rk is constant. □ 

If p(j,x) = a(j,x), the optimal t^'s are thus constant. This is somewhat 
surprising since one could conjecture that it is better to give higher proba- 
bility to those indices j for which the mass of a(j,x) is close to argsupft(x). 

The crucial point in implementing this algorithm is the choice of the 
densities p(j,-)- Lemma 1 implies that, for a high acceptance probability, 
all Mj's should be small, that is, each p(j,x) should be a good proposal 
distribution for the density a(j , x)b(x) / (3j . Ideally, we would choose that 
density itself. But then Mj must be close to the normalizing constant (3j 
which typically is not available in closed form. A more practical approach 
chooses a parametric family (p(6,x)), where we have available tight upper 
bounds 

, ■ „s a(h x)b(x) 

M(j, 9) > sup ■ v ' 



p(9,x) 

We then optimize over 9, that is, 

p(j> x ) = p(Qj > x ) ; where 9j « arg min M(j, 9) . 

Note that it is not necessary to find the optimal 9 exactly, but M(j, 9) should 
be a true upper bound. By choosing the family (p(9, x)) such that it contains 
all densities a(j,x), we can make sure that the acceptance probability is at 
least as high as with the usual algorithm. 
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Example. The simplest stochastic volatility model, see, for example, 
[27], is obtained if we take for a(j, •) the normal density with mean rrij and 
variance a 2 and for b the likelihood of a AA(0,exp(x)) random variable Y, 

b(x) = b(x, y) = ex P (~ | ~ y exp(-x)^ . 

If we choose as p(9, •) the normal density with mean 9 and variance a 2 , we 
can compute the supremum of 

a(j, x)b(x) x 9 — in- y 2 rn 2 - — 9 2 

log 7T — : — = 7^-x exp(— x) — k — 

6 p(0,x) 2 a 2 2 yy ! 2a 2 

over x. It is equal to 

Y 62 + ™j S - Q + S^j (1 + logy 2 ) + Q + s\ log(l + 25), 

provided 5 = {9 — rrij) / a 2 > —1/2 (otherwise the function is unbounded 
above). Minimizing this expression with respect to 5 subject to 5 > —1/2 
leads to a nonlinear equation which has no closed form solution. Using 
log(l + 25) < 25, we obtain a quadratic upper bound which is minimized 
by 

9j = rrij + y max^-1, (logy 2 - rrij)^ . 

This choice of 9j may be slightly suboptimal, but because the bound is sharp 
for 9 = rrij, that is, 5 = 0, we still can guarantee a higher acceptance proba- 
bility than with the usual method. In practice, the gain can be dramatic if 
|y| is small. 

The above choice of 9j is somewhat different from the suggestion 

9j = rrij + y (y 2 exp(-m i ) - 1) 
in [27], page 285. In addition, the choices for Tj differ. 

3.1.2. Balanced sampling. Besides reducing the acceptance rate, we can 
also try to reduce the variance by using a more balanced sampling: The 
target / is a mixture of N components, and the variance is reduced if the 
different components in the mixture are represented with the correct pro- 
portions. This idea has received much attention in the sampling importance 
sampling context; see Section 3.2.1 below and the references given there. We 
have not seen this idea in the accept-reject context. Consider the estimation 
of 



m (V ; ) = / / (x)ip(x) dp(x) 
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where 

f il>(x)a{j,x)b{x) dn(x) 

and V> is a bounded "test function." If (Xj) is an i.i.d. sample from / , the 
estimator 

"HW = ^ 

has variance 



N ^ N EjPj 

A method to reduce this variance replaces the random selection of an index J 
by a more systematic procedure. Namely, we can propose simultaneously iV 
values, one each from the density a(j,x), and decide whether to accept each 
of them independently. We repeat the procedure until the total of accepted 
values is at least N. If we need exactly N values, we can select them at 
random. We therefore consider the estimator 

Th=1 Ej=i ^{Xij^lUijKbiXij)] 



fh(ifj) 



Ei=l Ej=l l[f7y<6(Xy)] 

where (Xy , U^; 1 < j < N, i = 1,2, . . . ) are independent random variables 
with Xij ~ a(j, •), Uij uniform on (0, sup6(x)), and T is the smallest integer 
such that the denominator is at least N. 

In order to compute the variance of fh(ip) approximately, we use 

- / n / n Ef=i EjLi^(Xij) - m(^))t {u <KX )} 

Ei=lEj=l IL [i7 ii <6(X ii )] 

For simplicity, we assume that sup6(x) = 1. Then, by Wald's identity the 
denominator has expected value 

N 

E[T]J2(3r 

i=i 

In particular, the expected number of random variables that have to be 
generated is essentially the same as with basic i.i.d. rejection sampling. Sim- 
ilarly, the numerator has expectation zero and variance 

N 

E(r)EVar((V(X lj )-m^))l [C 7 li <6(x li )]) 



E(T) a 2 ^) J2 Pj ~ E^ 2 K W " ™W0) 



2 
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Assuming the denominator to be approximately constant and equal to N 
(which is reasonable if the expected number of accepted values in each round 
of proposals is small), we obtain the approximation 



The second term thus quantifies the gain of the method. 

3.2. Sampling importance resampling. This method generates (z^; 1 < 
k < R) according to some proposal p and selects from these a sample of size 
./V with inclusion probabilities 



The resampling need not be made at random. We will discuss below alter- 
native methods with reduced variability. The standard proposal is again the 
prior (9), leading to the original proposal in [15]. 

This method has difficulties if the sampling probabilities ir(zk) are very 
unbalanced since this leads to many ties in the final sample. Typically, this 
occurs in situations where the prior and the likelihood are in conflict, that 
is, when the acceptance rate in rejection sampling is low. Choosing R much 
bigger than N reduces the number of ties, but at the expense of longer com- 
putations. Note that rejection sampling is an automatic way of choosing R 
such that all ties are avoided. There are also possibilities for reusing the 
rejected variables for estimating the current filter distribution more accu- 
rately; see Section 3.3.3 of [25]. 

Most of the ideas discussed in connection with rejection sampling can also 
be used here. The idea of Pitt and Shephard [23] to include explicitly an 
index J was originally developed for this case. It proposes a sample (Jk,Zk) 
of size R with distribution Tjp(j, x) and then selects a sample of size iV with 
inclusion probabilities 



In contrast to rejection sampling, combining p(j,-) =q(j>") with unequal 
Tj's is a promising idea here. For instance, we can take tj to be proportional 
to b(mj), where rrij is the mean or the median of a(j, •). If all a(j, -)'s have 
a small spread (relative to the scale at which b varies), then most ir(zk, jfc)' s 
will be approximately equal and therefore R = N is sufficient. 
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3.2.1. Balanced sampling. Both in the proposal and in the resampling 
step, we have to select indices from a given distribution. In the former case, 
this distribution is {jk) and in the latter (ir(zk))- Balanced sampling is easy 
to implement and often can reduce the variance substantially. In the recur- 
sive implementation for filtering, we can combine the resampling step at the 
end of the current iteration and the selection of the index at the beginning 
of the next iteration into a single selection of indices. In order to keep the 
notation simple, we discuss the ideas in the context of the resampling step 
only. We denote the number of times the index k is selected by N/.. For ran- 
dom resampling with replacement, these multiplicities (Nk) are multinomial 
(N,(n(zk)))- Here, we look for more systematic sampling procedures. We 
require that J2Nk = N and that sampling is unbiased, that is, 

E[N j \z 1 ,...,z R ]=Nir(z j ). 

Then the estimator 

1 R 



3=1 

has the same expected value as the usual importance sampling estimator 

R 

5=1 



Its variance can be written as 

/ R 

Var(m(V0) = Var ^ VK^X^j 
\j=i 




where Cn(i,j) is the conditional covariance of Ni and Nj. The first term 
is the variance of the usual importance sampling estimator and the second 
term is the additional variability due to the resampling step. The advan- 
tage of resampling becomes apparent only when we consider several time 
steps: Without resampling, the recursive filter sample would quickly degen- 
erate, that is, practically all the weights would be given to very few values. 
Resampling splits the particles with large weights into several independent 
ones and kills some of the particles with very small weights. Nevertheless, we 
should try to minimize the additional variability introduced by resampling. 
Because it is not known in advance which functions ip will be of interest, we 
consider the supremum over all (bounded) test functions ijj. 
With multinomial iV,-'s, we have 

"i/>{z i )i/>(z j )c R (ij) = N(Y,Mzi)Mzi)- fe^teM*)! ) 



hi 



< N supifj(x) 
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Hence, resampling randomly with replacement can guarantee that the effect 
of resampling disappears asymptotically. 

Several methods have been proposed which reduce the (conditional) vari- 
ances Cn(i,i). Residual sampling [22] takes 

Ni = [Nir(zi)] + (N!) ~ multinomial (N', (V («())), 

where [x] denotes the integer part of x and 

Nn(zi) - [NTr(zi)] 



N' = N-J2[N7r( Zi )], 7r'(zi) 



N> 



This reduces ^2 it jip(zi)ip(zj)Cii(i,j) by the factor N' /N. Intuitively, we ex- 
pect the fractional part Nir(zi) — [Nir(zi)] to be uniform on (0,1), leading 
to an average reduction by a factor of two. 

The variance Cji(k, k) is minimal iff is equal to one of the two integers 
closest to Nir(zk)', see [6]. Together with the condition E[iVfc] = Nir(zk), this 
determines the marginal distribution of Nk- Crisan and Lyons [6] show that 
then also the expected relative entropy between the empirical distribution 
(Nk/N) and the target (Tv(zk)) is minimal. There are at least two algorithms 
such that all N^s differ by less than one from NTr(zk). The following one 
goes at least back to Whitley [28] and has been rediscovered by Carpenter, 
Clifford and Fearnhead [2]: 



(12) N. 



3k 



k-1 k 



Nj2<^) + U,Nj2<z Ji ) + U)n{l,2,...,N} 



where (ji,j2, • ■ • ,3r) is a random permutation of (1, 2, ... ,12), U is uniform 
on [0,1), and the absolute value of a finite set denotes the number of elements 
in this set. 

The second algorithm has been proposed by Crisan, Del Moral and Lyons 
[5]; see also [4]. One chooses an arbitrary binary tree with R leaves, labelled 
as 1, ... ,12, and one propagates iV particles from the root in a specific way 
down the tree. The value of Nj is then the number of particles ending at 
leaf j. In order to describe the propagation, we identify a node a with the 
subset of {1, . . . ,12} that consists of the leaves connected to a. Furthermore, 
we denote by N a the number of particles that pass through a node a. The 
expected value of N a must then be equal to fi a = N^j^a^j- The splitting 
at each node is done such that N a differs by less than one from fi a and 
EfiVo,] = fj, a . It is easy to see that this can be achieved: Each split is either 
deterministic or chooses between two possibilities with given probabilities. 
Decisions at different nodes are made independently. 

However, by minimizing Cn(k, k), we usually introduce strong dependence 
between different Nj's, and the effects of this are hard to control. Trivially, 
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\C R (i,j)\ < 1/4, but the bound 



N 2 

i,j X 

contains no useful information because it does not even allow one to conclude 
that the additional uncertainty due to resampling disappears asymptotically. 

With Whitley's algorithm [28], C R (i,j) can be either positive or negative. 
Since we know nothing about the sign of i/j^z^), I do not see how one could 
obtain a better worst case bound. Still, the following lemma supports the 
conjecture that, on average, the algorithm (12) will behave well. 



Lemma 2. For arbitrary probabilities (iri) and arbitrary N , consider the 
random variables 



3-t 3 \ 

v^TTi + c/; wy>i + cn n {i,2, 



=i 



,N} 



where U is uniform on (0,1) [this is the algorithm (12) without the ad- 
ditional permutation]. Then for any j < k, Cov(Nj,Nk) depends only on 
T\ = Nit j modi, r u = ./Vvrfcinodl and r m = NJ2iZj+i 7r i m ° ( i^; an explicit 
expression being given in the proof. Moreover, the average of this covariance 
with respect to the uniform distribution on (0, 1) for r m is zero for all values 
r% and r u . 



Proof. Because shifting a uniform random variable modulo 1 does not 
change the distribution, we may assume that j = 1. Moreover, it is clear that 
only the fractional parts ri,r m ,r u matter. If we put Mj = Nj — [Nitj] and 
Mfc = Nk — [Ntt^] , we obtain therefore 

E[MjM k ] = P[U G (0, n) n (n + r m - 1, n + r m + r u - 1)] 

+ P[U € {0,ri)n(ri+r m -2,ri + r m + r u -2)}. 
It is easy to evaluate the right-hand side by distinguishing different cases: 



' {n+r m + r u - 1)+, (n +r m <l,r m + r u < 1), 
r u , (n + r m >l,r m + r u <l), 

n, (n+r m <l,r m + r u >l), 

l-r m , (n + r m > 1, r m + r u > 1, 

ri+r m + r u < 2), 
ln-|-r- u -l, (n +r m +r u > 2). 



It is also easy to show that, by integrating over r m £ (0, 1), we obtain rir u 
in all cases. □ 
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Although the method is unbiased and has minimal variance without ran- 
domizing the order of the values, it seems wise to do so since it is com- 
putationally easy and we expect it to make the values r m approximately 
uniform. 

With the algorithm of Crisan, Del Moral and Lyons [5] we have control 
over the sign of Cr{j, k). 

Lemma 3. For the tree-based algorithm, we have, for arbitrary nonde- 
creasing functions f\, . . . , fn, 



E 



R 



<II E [/i(^ 



In particular, the covariances Cn{i,j) are negative for i^j. 

Proof. Denote the two nodes connected directly to the root by a and 
(3. Because the particles are propagated independently in the two subtrees, 



E 



l[fj(Nj)\N a ,Np 
■i=i 



E 



j'6Q 



E 



Lie/3 



Furthermore 
E 



Uf j (N j )\N a = [ l i c 
.jea 



< E 



Uf j (N j )\N a = [ fJ , a ] + l 

.jea 



since we can propagate first \p, a ] particles and afterward an additional par- 
ticle. Because N a and Np are negatively dependent, we obtain 



E 



< E 



urn 

.jea 



E 



The proof proceeds now recursively, by considering in the next step each 
factor separately and conditioning on the number of particles one level lower. 
□ 

This lemma implies that the additional variance due to resampling is 
reduced by a factor of at least two compared to multinomial sampling: 



Lemma 4. For the tree-based algorithm described, 



1 



^( Zi )^Zj)C R (i,j) < -$>(*) 



'■j 



< "TT SUP 1p(x) . 
£ x 
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Proof. Write ip as the difference of positive and negative parts and use 
Cauchy-Schwarz for the covariance between positive and negative parts; see 
also [4], page 31. □ 



For later use we formulate and prove the following large deviation inequal- 



ity: 



Lemma 5. For the tree-based algorithm, 



sup P 

Ac{l,...,R} 



> £ 



<2exp(-4e 2 /^)- 



Proof. For any t > 0, we have 



.j&A 



< exp(-te)E 



exp it^Ni-Nnj) 
\ jeA 



By Lemma 3, 



E 



exp 



jeA 



jeA 

Because Nj takes only two values, 

E[exp(t(iV J - - NiTj))] = exp(-trj)(l - r, + exp(t)rj), 

where r, = Nttj — [Nttj]. By the standard argument in the proof of Hoeffd- 
ing's inequality, the right-hand side can be bounded by exp (t 2 /8). Hence, 



<exp(-te+\A\t 2 /8), 



which is minimal for t = 4e/\A\. The probability of a deviation less than or 
equal to —e can be bounded by the same expression. The lemma then follows 
because we may assume \A\ < R/2 (the deviations for A and A c differ only 
in sign). □ 



3.3. Accept-reject versus sampling importance resampling. Generally speak- 
ing, the computational effort for rejection sampling is greater than for sam- 
pling importance resampling. By how much depends, however, on the specific 
situation. Note that with the auxiliary variable idea of Pitt and Shephard 
[23], it is possible to use the rejection method even in cases where the likeli- 
hood b is unbounded, for example, in the stochastic volatility model of Sec- 
tion 3.1.1 with y = 0. For both methods one needs to find proposal densities 
p(j,x) that approximate a(j,x)b(x), but for rejection sampling one needs, 
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in addition, an upper bound for a(j , x)b(x) / p(j , x) , which can be difficult in 
high dimensions. 

Usually, a large empirical variance of the inclusion probabilities ir(zk) is 
taken as an indication that the error of sampling importance resampling is 
large. However, a low variance does not guarantee a low error. When the 
true filter density is bimodal and if the proposal represents only one mode 
well, then the inclusion probabilities are fairly balanced unless the sample 
size is huge. If we are unable to compute the modes of the filter density, then 
rejection sampling is presumably the only way to obtain some guarantee for 
the algorithm in such a case. 

The results of the next section allow some theoretical comparison of re- 
jection and sampling importance resampling methods. We will show in Sec- 
tion 3.1.1 below that rejection sampling has a smaller asymptotic variance 
than the standard sampling importance resampling algorithm. Another rel- 
evant question is whether the errors of the methods depend on sup x bt(x,yt) 
(if they do, then it is not clear how much one gains by an algorithm which 
does not need a bound on this supremum). For the rejection method, both 
the exponential bounds in the law of large numbers and the asymptotic vari- 
ance do not depend on bt at all as long as the condition (20) is satisfied. For 
sampling importance resampling, our best bound in the law of large numbers 
depends on sup^, bt (x, yt) because of Lemma 9. The bound on the asymptotic 
variance does not involve the supremum of &t, but a certain L2-norm of bt- 



3.4. Computation of the likelihood. Combining (3) and (6), we see that 

N 1 

p{yt\yi-.t-i) / J^ a t( x j,t-i,x)b t (x,y t )dn(x), 

which is in the short notation of this section equal to If we use 

Tjp(j,x) as our proposal, then the usual importance sampling estimator of 

p(yt\yi-.t-i) is 

p{vt\vi:t-x) = tt^ 2^ r — r- 



3.5. Monte Carlo backward smoothing. There is a similar recursive sim- 
ulation method that generates samples from the conditional distribution 
of Xq-t given Y\-t = yi-.T- At time T, we use the recursive filter sample 
x s yY = Xj s T- We then proceed backward in time, using (4) together with 
an approximation of f t \ t . In order to avoid problems with discreteness, we 
recommend use of (6) as in [18], instead of replacing f t u by the empirical 
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distribution of the particles at time t as in [14] . This means that we generate 
x sm f rom Xj^i and (xij-i) by simulating from the density proportional to 

1 N 

(13) ett+i (x, XjJ +1 )b t (x, Vtjj-fYl, a t{xi,t-i,x). 

i=i 

[At time t = 0, we will use the density proportional to ai(x,Xj™)ao(x).] 
Clearly this has the same structure as (7) and so the same methods as 
discussed before apply in principle. However, we need one value from the 
density (13) for each j and thus sampling importance resampling does not 
seem to be useful here. For the same reason, care is needed when using the 
mixture index as an auxiliary variable. Since sampling from (ji) typically 
involves computing the partial sums of the Tj's, one should use the same 
distribution {n) for all j. Then the computational cost of the approach is 
0(TN) and thus at least comparable to a standard MCMC method. The 
main disadvantage of this approach is that we have to store all the filter 
samples. 

4. Theoretical properties. In this section we analyze the convergence 
of the approximation f^ t to the true filtering density f t \ t . We will hold the 
observations y\ . t fixed and drop them from the notation. In particular, we do 
not make any assumption about how the observations were obtained. The 
true filtering densities f t \ t are then deterministic, but the approximations 
f^ t are still random since their computation involves random sampling. All 
expectations and probabilities in this section concern the randomness of the 
Monte Carlo methods, and not the randomness of the state space model. We 
assume throughout that Xt takes its values in a complete, separable metric 
space equiped with the Borel cr-field, and we denote the metric on this state 
space by d(-, •). 

The operator notation for recursive Monte Carlo filters introduced in Sec- 
tion 2.3 will be used extensively. In addition, we denote by E^{f) the empir- 
ical distribution of a sample of size N from /. Then the approximate filter 
density is 

ft\t = B { A *t E N{ff_i\ t _i)A{-,yt))i 
and by (6) and (5) it has to be compared with 

ft\t = B{A* t f t _ l \ t _ 1 ,b t (-,yt))- 

In the first two sections we present two approaches for showing conver- 
gence of f^ t to fq t as N — > oo. We measure the error by the Li-distance 
between densities, see, for example, [9], Chapter 1, which can be written in 
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several equivalent forms: 

II/- Sill = / \f(x)-g(x)\dfi(x)=2 f(f(x)-g(x)) + d^x) 

(14) \ 

= 2 sup | Pf [C] -P g [C]\ = 2 (f(x) - mm(f(x),g(x))) dfi{x) 
c J 

{x + denotes the positive part of x). Clearly, if ||/^ — f t \t\\i converges to zero 
in probability or almost surely, then for any bounded function ip on the state 
space, the law of large number holds: 



1 N f 



in probability or almost surely. In the third section we show the correspond- 
ing central limit theorem. 



4.1. Stepwise error propagation. The obvious first attempt to show con- 
vergence uses the decomposition 

/$ - f t{t = B{A* t E N Uli\t-x)M) ~ B(4?/£i|t-i>&t) 

(15) 

+ B(A* t f t N _ llt _ 1 ,b t )-B(A* t f t _ 1]t _ 1 ,b t ). 

The first term is the error due to sampling at time t — 1 (propagated once) 
and the second term is the propagation of the error at time t — 1 . For a recur- 
sive inequality for \\fq t — ft\t\\ii we have to study the Lipschitz-continuity of 
Bayes and Markov operators with respect to the Li-distance and to control 
the sampling error. 

The continuity of Markov operators is well known; see [10], Section 3. 

Lemma 6. We have 

\\A*f-A*g\\ 1 <p(A*)\\f-g\\ 1 , 

where 

p(A*) = \ sup ||o(x, •) - a(x', -)||i < 1. 

x,x' 

Note that, for a compact state space, the Markov operator is typically 
contracting. 

The continuity of Bayes' formula with respect to the prior is more prob- 
lematic. We have, see [20], Lemma 3.6(i), the following: 
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Lemma 7. 



\B(f,b)-B(g,b)\\i<P{f,b)\\f-9\ 



where 



P(f,b) 



sup x b{x) 



fb(x)f(x)d[i(x) 



el, 



sup^, b(x) 



infaj 6(x) 



The difficulty is that this bound cannot be improved in general. Lemma 
3.6(h) from [20] shows that the Bayes operator is not contracting for any /, 
at least for some "directions" g. 

Finally, we have the following bound on sampling errors: 

Lemma 8. If x — > a(x, •) is continuous with respect to the L\-norm, then 
under i.i.d. sampling from g, 

J>[\\A*E N {g)-A*g\\ l >e] N -^U 

exponentially fast in N for any e > 0. The convergence is uniform for all g 
such that f K g dfi > 1 — e/6 for some fixed compact set K . 

Proof. The proof follows closely the arguments in [9], Chapter 3. We 
denote by //jy the empirical distribution Ej\r(g) and by \x g the distribution 
g(x)d[i(x). 

Let e > be given. Choose a compact K such that fi g (K) > 1 — e/6. Next, 
choose 5 such that \\a(x, •) — a(x', -)||i < e/6 for all x,x' € K with d{x, x') < 5. 
Then choose a partition {B±, . . . , Bj} of K such that each Bj has diameter 
at most 5 and choose a point Zj in Bj for each j. Finally, put Bq = K c . Then 

j JV J 



i=l 



i=i 



, JV J i N 

— X 1 B Q (xi)a(xi,x) + J2 TT X 1 B ] (xi)(a(xi,x) - a(Zj,x)) 

j=i i=l 



dfi(x) 



J 1 N f 
< Mjv(-Bo) +X jyE^iW / l a ( x *' x ) -a(^',x)|d/_i(x) 



j=i i=i 



< |m jv (5 )-m s (5o)| + 3- 
Similarly, we obtain 

J 

/ a(x,-)g(x)dfi{x) -22n g (Bj)a(zj,-] 
J 3=1 



e 

< -. 

- 3 
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J J 

J2 V 9 ( B 3 ) a ( z 3 > • ) - J2 ^ N ( S i ) ° ( z J ' ' 
i=i j=i 



<x;i^(^)-/x 9 (^)|. 
i j=i 



Taking these three inequalities together, we obtain 

2e J 

\\A*E N {g) - A*g\\ 1 < - + £ IM^O " 

Hence, the large deviation estimate for the multinomial distribution, 
■ J 

P Y,\^{B j )-pL g {B j )\>\ 



< 2 J+2 exp(-iVe 2 /18) 



([9], Theorem 3.2), implies 

P[\\A*E N (g) - A*g\\ 1 > s] < 2 J+2 exp(-iV e 2 /18). 

From this, the lemma follows (note that, once K is fixed, J depends only 
on the transition kernel a and not on g). □ 

Theorem 1. If x — > at(x, ■) is continuous and if for all t, all x and all 

y, 

0<b t (x,y)<C{t,y)<oo, 
then for all t and all yi-t, 



\\ft\t~ ft\t\ 







in probability as N — > oo. 



Proof. The proof proceeds by induction on t. For t = 0, there is nothing 



to prove because f£ Q = / |o = clq. From Lemmas 6 and 7, it follows that 



B(A* t f t N _ llt _ v b t ) ~ B(A* t f t _ llt ^,b, 



< 



if 

(16) 



l|t— 1 



t)\\l 

C(t,y t ) N 

Pdtelyi:^!) 117 *- 1 "- 1 "^ 11 

p(yt\yut-i) 



<£ 



/t-l|t-l||l < e ' 



C{t,yt) 



-:6. 



By the induction assumption, there is an Ni such that, for N > Ni, (16) 
holds with probability at least 1 — rj. 
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In order to bound the first term in (15), some care is needed when applying 
the bounds provided by Lemmas 7 and 8 with which is random. 

We have to show that when (16) holds, we can obtain bounds which depend 
only on ft-i\t-i- Note first that 

6 t (x,y t )(A*/ f -i| t -i(x)-A*/^i|^i(x))d/i(x) 
>-^C(t,y t )||/^ 1|t _ 1 -/ t _ 1 | t _ 1 || 1 . 
Hence, if (16) is satisfied, 

b t (x,y t )A* t f^m^ix) dfi(x) > (1 - e/2)p(y t \y 1:t ^ 1 ) > \p{y t \yut-l) 
and, therefore, by Lemma 7, also 

T7I / fN \ A* fN II 

- '^(w-i) - ^* Wiiii- 

Next we observe that, if K is compact such that J K ft-i\t—i d^i > 1 — 5/2 
and if (16) holds, then J K f^_ l ^ t _ l dfi > 1 — 5. Therefore, by Lemma 8 we can 
find N 2 such that, for N > N 2 , 

(17) ll^C/tVi) " < 65 

holds with probability at least 1 — rj. Collecting all the bounds shows that, 
for N > max(Ni,N 2 ), 

||/#-/t|tl|i<13e 
with probability at least 1 — 2rj. □ 



The conditions of this theorem are weak. However, the arguments in the 
proof require Wf^^ - /t_i|t_i||i to be smaller than ||/|[ - / t | t ||i. This 
means that the required sample size N grows with t. It is easy to see that, 
in general, N has to grow exponentially with t, and, thus, from a practical 
point of view, the theorem is not of great use. Strengthening the assumptions 
by, for instance, assuming a compact state space, does not help because by 
Lemma 3.6(h) from [20], the Bayes operator is expanding. Hence, for a more 
useful result, we need a different approach which is provided in the next 
section. 
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4.1.1. Sampling errors for sampling importance resampling. The results 
so far have assumed that the Monte Carlo filter uses i.i.d. samples of f^ t , 
which means using the accept-reject method (with or without auxiliary 
variables). It does not cover sampling importance resampling. In order to 
extend the results above, we need to adapt Lemma 8 to the different sampling 
method. 



Lemma 9. Let g have the form g = B(h,b) and let (xi,Ni) be a sam- 
pling importance resample from g, that is, (xi) is an i.i.d. sample from h 
and the N^'s are the multiplicities in the resampling step which uses prob- 
abilities TTi oc b(xi). Assume that x — ► a(x, •) is continuous for the L\-norm, 
that sup6(x)/ f b(x)h(x) dfi(x) < oo and that 



sup P 

JC{1,2,...,N} 



> £ 



< c\ exp(— C2-/Ve 2 ). 



Then 



P[\\A*E N (g)- A*g\\ 1 >e) N -^?0 



exponentially fast in N for any e > 0. 

Proof. The assumption of i.i.d. sampling was used in the proof of 
Lemma 8 only to obtain an exponential bound for 

J 

J2\f i N(B j )-(ig(B j )\>^ . 

Hence, we have to obtain such a bound by different arguments. By Scheffe's 
theorem and Bonferroni's inequality, we have 
J 



^2\t*N(Bj)-lig(Bj)\>- 

lj=0 



< 2 J+1 supP 

B 



\ m (B)-^(B)\>- 



where the supremum is taken over all sets B in the cr-field generated by 
(Bq,Bi, ...,Bj). We can decompose via 

m (B)-» g (B) 

N 



E 

i=i 



Ni 
N 



1 



1 



v 



Jb(x)h(x)dn(x) 

N 



b(x)h{x) dfi(x) 



+ 



1 



1 



fb(x)h(x) d/i(x) \ N 



^2b(xi)l B (xi) 



i=l 



13 



b(x)h(x) dpi(x) . 
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The assumption on the resampling method gives an exponential bound for 
the probability that the first term is larger than e/18. Hoeffding's inequality 
provides analogous bounds for the second and third terms. □ 

Applying this lemma with h = N~ l J2j a i x j,t-2r) and b = & t _i(-, yt-i), 
we obtain the analogue of Theorem 1. The arguments in the proof of this 
theorem show that in this case b(x)/ J b(x)h(x) d/j,(x) is bounded. 

4.2. Analysis based on considering several steps. Clearly, we can look at 
error propagation over more than one time step. If we define 

K S)t {f) = K 8+ltt (B(A* +1 f, b s+1 )) (s < t), K tit (f) = f, 

then, for any s < t, f t \ t = K s j{f s \ s ) and, hence, 

ft%-ft\t= E (K r , t (B(A;E N (f r N 1{r ^),b r )) 

r=s+l 

(18) -K r , t (B(A* r f» llr _ v b r ))) 

Here the last difference is the error at time s propagated over t — s steps. The 
other differences are the errors due to sampling at time r — 1, propagated 
over t — r + 1 steps. 

This is only useful if we can give a bound on the error propagated over k 
steps which is better than the sum over k single steps. It is possible because 
an alternative way to get from / s i s to f t \ t is to apply first the Bayes operator 
once with likelihood equal to the conditional density of y s +i-.t given x s , 
followed by t — s Markov operators for the conditional transitions from x r 
to x r _|_i given y r +i-.t- The contractivity of the Markov operators can then 
beat the expansion of the Bayes operator. It requires, however, a uniform 
nontrivial upper bound for the contraction coefficient of the conditional 
chain given y r+ \-t, and for this, we need the following condition: There 
are probability densities h t and two constants < c a < C a < oo such that, 
for all x and x' , 

(19) c a h t (x) <a t (x',x) <C a h t (x). 

Condition (19) on at is reasonable when the state space is compact, al- 
though it is slightly stronger than uniform ergodicity. Using (14), we see that 
the lower bound of (19) alone implies p(A^) < 1 — c a and thus also uniform 
ergodicity. Condition (19) includes even some examples with unbounded 
state space. For instance, (19) holds for the model 

X t = g{X t - 1 ) + Vt 
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if g is bounded and Vt has a density whose logarithm is uniformly Lipschitz 
continuous. This is satisfied for most heavy-tailed distributions, but not for 
the Gaussian. For Gaussian Vt, (19) is false: There is no density ht such that 
the two bounds in (19) hold simultaneously. We thus have an example of a 
uniformly ergodic chain that we cannot treat with our arguments. 
Concerning bt, there is an almost minimal condition, namely, 

(20) < J b t (x,y t )h t (x)dfi(x) < oo 

for all t and all yt- Some arguments become much simpler, however, if we 
replace (20) by 

/ 01 x n b t (x,y) 

(21) C b := sup —— — r < oo. 

t,x,x',y h(x ,y) 

The following lemma shows that, under condition (19), the error propa- 
gated over several steps decreases exponentially. Many versions of this ex- 
ponential forgetting of the initial conditions of the filter have appeared in 
the literature; see, for example, [7, 8, 21] and the references given there. We 
use the version of [11], Lemma 1. 

Lemma 10. Assume conditions (19) and (20). Then for any two densi- 
ties f and g and any s <t we have 

\\K S)t (f) - K Stt {g)h < -(1 - 7a)*" s ll/ - g\\i, 

7a 

where r y a = c a /C a . 

Proof. As already mentioned, we write K s j as the composition of one 
Bayes operator and t — s Markov operators. The likelihood in the Bayes 
operator is equal to the conditional density of y s +i-.t given x s . It satisfies 
the recursion 



P(y s +i:t\x s ) = J a s+ i(x s ,x s+ i)b s+1 (x s+1 ,y s+1 )p(y s+2 :t\x s+ i)dfi(x s+ i). 

Hence, by conditions (19) and (20) and an induction argument, 

, 22 , suY> Xs p{y s +i:t\x s ) < J_ 

vaZ Xa p(y 8 +\;t\x 8 ) ~ 7a' 

which is, by Lemma 7, the maximal expansion by the Bayes operator. The 
Markov operators have transition densities 

a r (x r -i, x r )b r (x r ,y r )p(y r+ i : t \x r ) 
p[x r \x T —\,y r -t) — , , 

p{yr:t\Xr-l) 
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which are bounded below by 

h r (x r )b r (x r ,y r )p(y r+1 ; t \x r ) 
a Jh r (x r )b r (x r ,y r )p(y r+ i-t\x r )dn(x r )' 

The right-hand side is 7 a times a density that does not depend on x r -\. 
Hence, by Lemma 6 and (14), each Markov operator contracts at least by 

(l"7a). □ 

Theorem 2. Assume that the transition densities at are the same for 
all t, that they are continuous in the L\-norm and satisfy (19), and that 
(21) holds. Then to any e > 0, there are constants c\ and C2 such that, for 
all t and all N, 

I>[\\ft\t- ft\th> £]<ciexp(-c 2 N). 

Proof. Because at and thus also A* are the same for all t, we drop the 
time index during this proof. Let e > be given. Choose k such that 

la 

Assume first that k < t. Because the Li-distance between densities is at 
most 2, we obtain, in this case from the decomposition (18) with s = t — k 
and Lemmas 10 and 7, 

\\ft\t ~ ft\t\\i 

<^ E (l-la) t ~ r \\B(A*E N (f r N _ llr _ 1 ),b r ) 

> a r =t-k+l 

-BC^C/^i^AJIIi + e 
<— £ (l-~? a ) t - r \\A*E N (f r N llr „ 1 )-A*f r l\\ 1 +e. 

' a r=t-k+l 

If k > t, we obtain a similar result by considering the decomposition (18) 
with s = 0. (Because f^ = / |o = a o, the e at the end is then absent.) Hence, 
if 

(23) sup \\A*E N (C)-A*fZ\\ 1 <e 

t-k<r<t 

holds, then, by the formula for a geometric series, 

\\f t %-f tlt \\i<(C bl - 2 + l)s. 

We are now going to bound the probability that (23) occurs. Note that e 
and thus also k are fixed. Because of Lemma 8, all we need to show is that 
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the set of distributions (f^ r d/j,) is tight. By the definition of f^ r and by the 
conditions (19) and (21), we have 

f N / \ J2jLi a ( x j,r-i,x)b r (x,y r ) „ „ , , > 

J r \ r \X) = T7 — < CbC a h{X). 

J2j=t I a(xj,r-i,x)b r (x, y r ) dfi(x) 
Clearly this implies the desired tightness. □ 

The important feature of the above theorem is that the same N works 
for all times t. By Bonferroni's inequality, we obtain 



sup||/ t u-/ t |t||i >£ 

t<T 1 



< Tc\ exp(— C2-/V). 



Hence, it is sufficient to let N increase logarithmically with the length of 
the series to guarantee uniform convergence of the filter approximation at 
all time points. It is not difficult to extend the above theorem to cases where 
the state transitions depend on t as long as the continuity is uniform in t. 
Condition (21) is used in the proof for bounding 

\\B(A*E N (f r l),b r+1 )-B(A*f r f r ,b r+ i)\\ 1 

by applying Lemmas 7 and 8. The following lemma provides a direct way to 
bound the above distance by imposing only conditions on a, but assuming 
a compact state space. 

Lemma 11. Let a be a transition density on a compact state space that 
satisfies (19) and 

(24) A(x',x):= S n P la ^ X "l~^ X '^" )l ^0 [d(x,x')^0] 
x" n{x") 

with the same density h as in (19). Then under i.i.d. sampling from g, 

J>[\\B(A*E N (g),b)-B(A*g,b)\\i>£} N -^?0 

exponentially fast in N for any e > 0, uniformly over all densities g and all 
likelihoods b with < / h(x')b(x') dfi(x') < oo. 

Proof. To make the notation more compact, we introduce 

Then q(x',x) is again a transition density and we can write 
B(A*E N (g),b)(x)=J2 ^N J 
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and 



B{A*g,b){x) 



g(x')f3(x') 



-—q{x\x) dn(x'). 



f g(x")f3{x")dii(x") 
The difference between these two expressions can thus be decomposed as 

1 /l^ f \ Y$-iP(xi)q(x h x) 



fg(x)P(x)dpt(x) \Nfr{ 



N 



+ 



fg(x)0(x)dfi(x) {Nf^ 1 

g(x')p(x')q(x',x)d^x')). 



By assumption (19), we have 

la < 



P(x) 



fg(x)/3(x)dfi(x) 
Hence, it follows by Hoeffding's inequality that 



< 7 " 1 . 

— la 



N 



P(Xj) 



J^ 1 fg(x)P(x)dfj,(x) 



1 



> e 



<2exp(-2Ne 2 Y a /a-r a 



Because the Li-norm of J2j P( x j)q(xj , x) / J2j P( x j) is one > we have the same 
bound for the probability that the Li-norm of the first term is greater than 
e. 

Assumption (24) allows us to control the continuity of x — > (3(x)q(x, •) = 
a(x, •)&(•) with respect to the Li-norm: 

\\a(x,-)b(-) - a(x' rW-)]]! < A(x,x') f h(x")b(x") dfi(x"). 



Hence, the same argument as in Lemma 8 can be used to prove an expo- 
nential bound for the probability that the Li-norm of the second term is 
greater than e. □ 



By looking at the proof of Theorem 2, this lemma implies immediately 
the following: 



Theorem 3. The claim of Theorem 2 is valid if the state space is com- 
pact, the transition densities do not depend on t and (19), (20) and (24) 
hold. 
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4.3. Central limit theorems. The goal of this section is to show by a 
simple induction argument that 



( 1 N f 

\ 7=1 



0<s<t 



is asymptotically centered normal for any fixed t, any y\ ■ t and functions i/j s , 
< s < t, which are square integrable w.r.t. f s \ s . Del Moral and Miclo ([8], 
Corollary 20) have obtained a similar result, but we do not assume the ^ s 's 
to be bounded nor the likelihood b t (-,yt) to be bounded away from zero. 

Our argument proceeds by induction on the number t of time steps. For 
t = 0, the result is obvious because (xj,o) is an i.i.d. sample from / |o = ao- 
The key idea for the induction step is to condition on (xj t t-i). We first 
explain the argument heuristically. Introducing the notation 

1 N 

.,;;«,«,,,, 



m t(ip)=J ip(x)f t \ t (x)dfi(x), 

we can split 

y/N(M Ntt (il>)-mt(sl>)) 

(25) 

= VN (M N>t (ip) - mjv,t(^)) + Vn (m N , t (4>) - m t {^)). 

We assume that, conditionally on all samples up to time t — 1, (xj t t) is an 
i.i.d. sample from Then the first term in (25) has the conditional limit 

distribution N{Q,a 2 N t (ip)), where 

o-%,tW = J {^{x) ~ m N} t(ip)f ft\ t (x) dfj,(x) 

~ = J W>0*0 - m t {ip)) 2 f t \ t {x) dfi(x) 

if f^ t converges to f t \ t . By the recursions for f t \ t and f^ t , (l)-(2) and (6), 
respectively, 



J2j L t l(xj,t-i) rn t -i(L t l) 



(26) VN(m N ^) - mtW) =Vn( 
where 

L t ip(x t -i)= / a t (x t -i,xt)bt(xt,yt)ijj(xt)dLi(xt). 



RECURSIVE MONTE CARLO FILTERS 29 

Asymptotic normality of the second term of (25) follows therefore from the 
induction assumption and the delta method. 
We now state and prove a rigorous result. 

Theorem 4. If x — > at(x, •) is continuous and if for all t, all x and all 

y, 

0<b t (x,y)<C{t,y)<oo, 
then for all t, all y\-t and all functions tp with 

°t (VO = / OO) - m t (tp)) ft\t{x) dfi{x) < oo, 



the recursively defined asymptotic variance 

v \yt\yx:t-i) 

is finite. Moreover, if a 2 ^ s ) < oo for s = 0, 1, . . . ,t, then the vector y/~N x 
(M]sf tS (ifj s ) — m s (i( ) s))s=o,...,t converges in distribution to aM(0, (V^ s (^/v, ?/><;))) 
random vector, where 

Vr^lpr^t) = V rit -l(lpr,L t (l(} t - m t {lpt))) 

forr<t and V t ,t^u<i>t) = (W* + <t>t) ~ Wi) - V t (<p t ))/2. 

Proof. Using the Cramer- Wold device, it is sufficient to show that 

t 

Z N = Vn^MnA^s) - m s (i(>s)) 

s=0 

is asymptotically centered normal with variance 

i 

T 2 = V r)S (lp r ,1p s ). 
r,s=0 



For t = 0, the theorem is trivially satisfied, and for the induction argument, 
we decompose = + Zjy , where 

Z$ = sfN{M N ^t) ~ m N ,tU>t)) 



and 

_t-l 

Z {2) = y/N{m N , t tyt) ~ mtCk)) + v^V^(M WjS (^) - m,(^)). 

s=0 

We first assume that ipt is bounded. Denoting by Tt the a-field generated 
by the (xj jS ; 1 < j < N, < s < t), we can write 

E[exp(iAZjv)] = E[E[exp(iAzJ ) )|J^_i] exp^AZJJ )]. 
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Since conditionally on Ft-i the Xjj's are i.i.d., we have 



E[exp(iAZ<J ) )|.F t _ 1 ] = (E 



N 



Furthermore, by a Taylor expansion of exp(iu), 



E 



i-j={il>t{x\,t) ~ mN,t{4>t)) 



< 



t-i 



1 + 



2N 



|A| 3 sup |-0t(^)| 3 



6AT3/2 

Similarly, because 1 — u < exp(— u) <\ — u + u 2 for all u > 0, 

A 4 sup|V>t(a;)| 4 



X 2 a% Jib t ) „ _ 

1 ~ ex P (-A 2 ^ >t (^)/(2iV)) 



< 



4N 2 



Because \u N — v | < iV|u — v\ for |ii| < 1, \v\ < 1, we therefore obtain that, 
for any A, 



E[exp(fAzS ) )|^„ 1 ] - ex P (-AV^(^)/2) 

converges to zero as N — > oo uniformly. By Theorem 1, ||/^ — con- 
verges to zero for N — ► oo. Because ^ is bounded, this implies that <r% t (ipt) 
converges to a 2 (ipt)- Therefore, 

E[| exp(-AV^(^)/2) - exp(-A 2 <7^)/2)|] ^ 0. 

(2) 

We now turn to the second term, Z N . The conditions of the theorem 
guarantee that 



mt-\(L t l) 



ft-i\t-i(x t -i)at(x t -i, x t )bt(x t ,yt) dfi(xt-i) dfi(x t ) 

=p(yt\yi-.t-i) 

is strictly positive, and Ltipt an d Lfl are easily seen to be bounded if ipt is 
bounded. Hence, the conditions for the delta method are satisfied, and so 
vN(mNt (ipt) -mt(^t)) is asymptotically equivalent to 



1 



Np(yt\yut-i) 



^2(L t tpt{xj,t~i) - m t -i{L t ip t )) 
< j 

- m t (ipt) ^2(L t l(xj !t -i) - m t -x(L t l)) J . 

.7 / 
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This is equal to \fN(M Ni t-x{<t ) t-i) ~ m t -i((t>t-i)) , where cp t -i = L t (ip t - 

(2) 

mt(ipt)) /p{yt\yi ■. t-i) • Hence, by the induction assumption, E[exp(iAZ] v ; )] 
converges to 

/ X 2/t-2 



exp( — -( Vr,s(A,A 



r,s=0 
t-2 



+ 2j2V S)t -i(^ s ,tpt-i +(f>t-i) + V t - 1 (4>t-i +(f>t-i)JJ, 

which is equal to exp(— A 2 (r 2 — cr^ (V't)) /2) because V^t(v) is bilinear. 
Taking all this together we obtain that, for bounded ipt, 

\E[exp(i\Z N )] -exp(-A 2 r 2 /2)| 

< E[|E[exp(iAZj|} ) )|.F t _i] - exp(-A 2 a 2 (^)/2)|] 
+ |E[exp(u4 2) )] -exp(-A 2 (r 2 - a 2 (^))/2)| 

converges to zero. 

The last part of the proof deals with the case when ipt is unbounded. 
We show first that crt(tpt) < oo implies Vt(tpt) < °°- Again we use induction. 
For t = 0, this is clear because (Tq(V') = Vq(^). For the induction step, it is 
sufficient to show that at~i(L t (ipt — mtfyt))) < oo because, by our assump- 
tions, p(yt\yi ■. t-i) > 0. By Cauchy-Schwarz, L^tp < L t (^ 2 )L t l, and by our 
assumption, L^l < C(t,yt) is finite. Hence, by the definition of Lt and the 
recursions (l)-(2), 

ffw(i((^t-mfW)) < mt-i(Lf(ipt - m t (tp))) 

<C(t,y t )m t ^(L t ((t/; t -m t ^)) 2 )) 

= C(t,y t )p(yt\yi:t-i)crt(^t) < oo. 
For the asymptotic normality, we use a truncation argument. We set 
i>t,c(x) = ^t{x)t{\^ x) \< c }, ip t ,c( x ) = ~ ^V(aj). 
Because Vt(tpt) < oo, it follows by dominated convergence that 

(27) Vr tt {ll)r,l!>t,c) Vrjilpr^t)- 

Next, we are going to show that 

(28) Jim ]imnxpP[VN\M N t@ ttC ) - m t (^ c )\ > e] = 0. 
We first condition on Tt-\- By Chebyshev's inequality, 

Pr[y/N\M N , t @ t>c ) - m t ® t>c )\ > 

4 ,-r2 

e 2 
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We therefore have to study the expectations of the two terms on the right. 
By (26), 



N(m N , t (il>t,c)-rnt(il>t,c)) = vN 



Lc i 



Y^j L t l(xj )t -i) mt-x(L t l) 

which by the induction assumption is asymptotically M(0, Vt-i(Lt{^> tc — 

m i(V ; t,c))))"distributed. For c— ► oo, this variance goes to zero, implying the 

rN 



desired behavior of the first term. By the recursion for /JY 



}2j L t l(xj : t-i) 

which, by the induction assumption, converges in probability to 
J ipt,c( x )ft\t( x ) dfi(x). Hence, by dominated convergence the second term also 
has the desired behavior, and, thus, (28) follows. 

Now we have all the ingredients to complete the proof. We write 

_t-i 

Z N ,c = ^/N^2(M N ^ S ) -m a (ij>,)) + (M N)t (i) t ,c) ~ mt(if>t,c)) 

and r 2 for the asymptotic variance of Zn,c- Then 
|E[eoq)(iAZiv)] - exp(-A 2 r 2 /2)| 

< |E[exp(iAZjv iC )] -exp(-A 2 r 2 /2)| 
+ | exp(-A 2 r 2 /2) - exp(-A 2 r 2 /2)| 

+ E[|exp(iA\/]V {M N>t {iJjt,c) ~ rn t (ipt,c) - M N ^) + m t (V))) - 1|]. 

By (27), the second term is arbitrarily small if c is large enough. Using 
|exp(m) — 1| < min(2, \u\) and (28), the same thing holds also for the last 
term, uniformly in N. Finally, the first term goes to zero for any fixed c as 

N ->-oo. □ 

4.3.1. The asymptotic variance. Similarly as in the case of convergence 
of fK, one would like to know whether the asymptotic variances Vt(ip) stay 
bounded as t increases. Using ideas from [3], we show that this is the case 
if ip is bounded and the condition (19) is satisfied. Because m t ~i(L t ip) = 
nk(ip)p(yt\yi:t-i), we have 

mt-i(L t (ip - m t {i>))) = 0. 

Hence, by iterating the recursive definition of Vt(ijj), we obtain 

* al^(L s .. t ^-m t m) 



(29) W)=<W) + 



p 2 (y s -.t\yi: 
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where 

L s:t ip(x s -i) = / 4i(x t )W_a r (x r -i,x r )b r (x r ,y r )dn(x r ) 

r=s 

= E[i/j(X t )\x s - 1 ,y s: t\p(ys:t\x s -i)- 

Here, the expectation is with respect to the state space model and not with 
respect to the random sampling in the Monte Carlo filter. Thus, 

L s:t {ip-m t (ip)) = (E[ip(X t )\x s -i,y s: t]-'E[^(X t )\yi :t ])p(y s . t \x s - 1 ). 

Because p(y s: t\yi: s-i) = Jp{.ys:t\x s -i)f s -i\s-i{xs-i\yi:s-i)dfjL(x s - 1 ), it fol- 
lows from (22) that 

p{y S :t\x s -l) J_ 



P(ys:t\yi: s-l) la 
Moreover, condition (19) implies uniform contractivity of the conditional 
chain given compare Lemma 10. Hence, under condition (19) we have 



t-s+l 



2 



|E[^(X t )|s a _i,y a:t ]-E[^(X t )|i/i:t]| < [sup^(x)-inf^(x))(l-7 a ) 
and, therefore, 

(30) V t (ip) < 7~ 3 (supV(x) -iniip(x 

\ X x 

So far, we have dealt with the case where (xjj) is an i.i.d. sample from f^ t , 
usually generated by an accept-reject method. For sampling importance re- 
sampling, asymptotic normality can be proved by a similar recursive ar- 
gument; see [3]. However, the formula for the variance Vt changes slightly. 
Random resampling leads to the recursion 

p 2 [.yt\yi:t-i) 
l 

p 2 (yt\yi-.t-i)' 

(The second term comes from the resampling step and the third from the 
reweighting.) Using again m t -i(L t ip) = mt(ip)p(yt\yi;t-i), we obtain 



+ -,mt-i{L t {bt{-yt){i> ~ m t ^) f ) - L 2 {^ - m t (Y>))). 



W) = a t (ip) + 



(31) + 



p 2 (yt\yi-.t-i) 

m t {b t {ip - mtjip)) 2 ) 

p(yt\yi:t-i) 

+ y- m s{b s L 2 s+l . t {4) - m t (il)))) 
~! p(y s \yi : s -l)p 2 (y s +l:t\yi:s) 
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[we set Lt+i-.t(ip) = tp]- Using Cauchy-Schwarz, one can show that each 
summand in (31) is always greater than or equal to the corresponding term in 
(29) and, thus, the additional effort of generating an i.i.d. sampling reduces 
the variance. 

Because of the slightly different form of the asymptotic variance, one also 
needs additional conditions in order that Vt(if)) in (31) remain bounded uni- 
formly in t. Using the previous bound for (L( s+ i) : t (i> - m tW))/p(y s +i :t\yi:s), 
one needs, in addition, a bound for 

m s (b s ) J f s -i\ s -i{xs-i)a s (x s -i,x s )bl(x s ,y s ) dfi(x s -i) dy(x s ) 

p(y s \yi ■. s-i) (J f3-i\a-i(x s -i)a s (x s - 1 ,x s )b s {xs,y s ) dp,{x s -i) dn{x s )) 2 ' 

Obviously, this is bounded uniformly in s and y under the condition (21). 
Using assumption (19), we can replace (21) by a slightly stronger version of 
(20), namely, that 



for all s and all y s . However, the bound for Vt then depends on yi-t- 

4.4. High rate sampling. So far, we have worked with a fixed sampling 
rate which we set equal to one for simplicity. Alternatively, we can consider 
what happens when the sampling rate converges to zero. We discuss this 
case briefly in this last section. So we let (Xt) be a Markov process in 
continuous time, and we assume, for simplicity, that it is time homogeneous 
with transition kernels P[Xt+ s G dx\Xt = x'\ = a(s,x',x) dfj,(x). We consider 
the sampling rate 5 = 1/m with m £ N, and we assume that, for a given m, 
we have conditionally independent observations Yjs,j = 1,2,..., such that 
Yjg depends only on Xj$. 

In the previous two subsections we showed how the strong condition (19) 
allows one to obtain convergence results that are uniform in t and require 
essentially no conditions on the observation densities. Unfortunately, this 
strategy breaks down in the high rate sampling limit. In continuous time, 
the analogue of (19) is 



for some fixed h and all t, x, x' . It is easy to see that if the lower bound c a (t) 
is of larger order than t as t — > 0, then ||af (x, •) — <h{x\ -)||i =0 for all t > 0. 
Hence, except for degenerate cases, the crucial quantity 7 a diverges at least 
like <5 -1 for 5^0. Moreover, the continuity module of x — > a(S,x, •) which 
is used in Lemmas 8 and 9 also diverges. Because the asymptotic variance 
Vt(t/j) in (29) is exact and does not depend on Lemmas 8 and 9, it is slightly 
easier to study the behavior of Vt(V0 as 5 ^ 0, and we concentrate on this. 



(32) 




(33) 



c a (t)h(x) < a(t, x', x) < C a (t)h(x) 
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Even this is not trivial. The simplest case occurs if the state space is finite 
and all jump probabilities are positive. Then it is easily seen that (33) holds 
with c a (t) = c a t, c a > some constant, and C a (t) = C a . Inserting this into 
the bound (30) for the asymptotic variance, we obtain an upper bound for 
the asymptotic variance of the order m 3 which is not satisfactory. Of course, 
our bounds are presumably not sharp, but it is not obvious how to improve 
them in general. We believe that the behavior in the high rate sampling 
case depends on the properties of the observation process. If we have a fixed 
observation density and we increase the sampling rate, we accumulate more 
and more information about the state process in any fixed interval, and the 
filter distribution will converge to a point mass except near the times where 
a jump occurs. With high rate sampling, it is somewhat more natural to 
let the information that is carried by a single observation decrease with the 
sampling rate. Then we need additional superscripts for the observations 
and their densities. The standard example is 

(34) \X j5 = x~ M(5g(x),5a 2 ), 
and we will study this case. Then the partial sum process 

(35) 4 m) = E 4 m) 

jS<t 

converges for m — > oo in distribution to the process rjt = Jq g(x s ) ds + aWt- 
If t is fixed and the sampling rate increases, the formula for Vt contains 
0(m) summands. Moreover, with (34), the filter distributions are not de- 
generate and the function x s _s E[ip(Xt)\x s -5, y s -.t\ does not converge to a 
constant. Hence, we expect that, for fixed t,Vf(ip) is of the order m. This is 
not surprising: At each time step 5 we take a new sample even though the fil- 
ter distribution changes very little. The sampling errors accumulate because 
the filter does not forget its initial condition over a finite time interval. In 
this setup, it is much better to use sequential importance sampling, that is, 
to carry the weights forward by multiplication instead of resampling at each 
time step. We thus generate a sample (xj,k6i k = 0, 1, ... ) from our model of 
the state process and compute the weights A^l^xj^. f-s) sequentially, where 

Atl(^:ks) = f[b^(x e5 ,y^) 

e=i 

is the likelihood. Then 

M N,ksW = — v (m) " 



3G 



H. R. KUNSCH 



is an asymptotically normal estimator of mks(tp) with asymptotic variance 



(Ex indicates that the expectation is only with respect to the state variables, 
the observations are considered to be fixed). We show first that, for a fixed 
time kS, this variance remains bounded as the sampling rate increases to 
infinity. We assume the state process to be a diffusion, 



where (B t ) is a Brownian motion. 

Theorem 5. Consider the state space model {Xj^,Y^) defined by (36) 
and (34), where f , a and g, together with their first and second derivatives, 
are all bounded. Assume, moreover, that the partial sum process (35) con- 
verges in the sup-norm to a continuous function rj. Then for 5 — > and 
k5—>t, t fixed, the asymptotic variance Vks(^p) of the importance sampling 
estimator is bounded by sup x ip 2 (x) times a constant that depends only on t 
and the supremum of \n\ on [0,t]. 

Proof. In order to simplify the notation we assume that a = 1. More- 
over, we put 



(E x [A { ™l(X S ; k sW 



(36) 



dX t = f{X t )dt + d{X t )dB t 




Then we can replace the observation density b in the likelihood A^" 1 ) by b. 
Summation by parts gives 




k-i 





The Ito formula implies 



= / Dg{x t )a{x t )dB t 
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Because /, a, g, Dg and D 2 g are all bounded, it follows that 

logA<?>(x i:M ) + [ kS 4 m) Dg(x t )a(x t )dB t 
Jo 

is bounded above and below by constants that depend only on k5 and 
sup{|?? 4 (m) |;0<t<fc$}. Finally, again by the Ito formula, 

E exJj™ 4 m) Dg(X t )a(X t )dB t } 

= E[exp(-i £ 5 rii m)2 (Dg(X t )a(X t )) 2 d?j , 

which is again bounded above and below by constants that depend only on 
k8 and the supremum of ^"^l- Using this, the rest of the proof is obvious. 

□ 



Finally, one can look at the case where both the sampling rate m and the 
time index t of the filtering distribution increase. We show that in this situ- 
ation the asymptotic variance remains bounded if we resample at fixed time 
intervals which for simplicity we take equal to one. Accept-reject meth- 
ods at a fixed rate cannot be used in this case because the supremum of 
A/ m \w , i m (x s -i+s- ,) diverges as m increases. 

By similar arguments as before, the asymptotic variance at time t E N of 
this version of the particle filter is 



(38) 



s=l 



+ J2 [m._l (E X [A^m+l : sm ( X s-1 + S : s ) 

xL^l 2 t (iP-m t (mXs)\x s -i]) 



(m) 



As above, we replace the observation density b in the likelihood A^" 1 ) by b 
since this has no effect on the right-hand side of (38). We define 



, ( m ) 



fc-1 

x Yl a{5,x^_ 1)s ,Xis)dn(x iS ) 
i=j+i 
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We then assume that there exist a probability density h and two functions 
c and C — ► M+ such that, for all xjs, 

c((j - k)S, MH (jS, kS)) < < - k)8, MH fctf)) , 

(39) 
where 

M( m )(s,t)= sup -vi m) \- 

s<u<t 

It follows from the proof of Theorem 3 in [1] that the assumption (39) is 
satisfied in the case where the state process is a diffusion on a compact 
Riemannian manifold with strictly elliptic generator. 
Because for j < I < k 

J jTk ( x j8,x kS )= J jf""} (x jS ,Xis) jjFjj (x t 8,x kS )dn(xt 5 ), 

assumption (39) implies that, for k — j > to, 

P(ytj+i)5:ks\ x js) _ JJ- m k(xj5,Xk6)dfi{x kS ) C(l,M^ m \j5,j5 + l)) 
P(ylTli)S: k s\ x ' 3 s) " /^72(^,x M )dM(x M ) " c(l,MM(j5,j5 + l)) " 
Similarly, 

p(x s i\x -8 y (m) ) = - J ^+ m ^ Xj5,XjS+1 ^ J ^+ )m -- k ^ Xj5+1,Xk5 " > d ^ XkS "> 
3 + 3 (j+1)5 - kS J Jj m k \x jS ,x k5 )dfi(x kS ) 

is bounded below by 

c(l, M( m ) (jSJ6 + 1)) h(x jS +i)J J^.^Xjg+uXks) dn(x kS ) 
C(l, MM (jS,jS + 1)) j h{XjS+l) j(^ m k ( XjS+u Xks) dfi(x jS+ i)dti(x kS )' 

The second ratio on the right-hand side is a probability density which does 
not depend on xjs, and, thus, one minus the first ratio is a bound for the 
contraction rate of the conditional chain. Therefore, we have 

l-ffiL^-rafMX^)! < sup x ^(x) -inf x ^(x) 

P{ys+S:t\yS:s) H ' 

where 

c(l,M( m )(r,r + l)) 
7 ^ j ~ C(l,MM( r ,r + l))' 

In contrast to the case with fixed sampling rate, these coefficients depend on 
the observations, that is, they are random. For nonrandom bounds that hold 
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with high probability, we would have to assume that the limiting process r\ 
has stationary increments. Finally, we can bound 



by similar arguments as before. 
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